2 research outputs found
The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages
Gender biases in language generation systems are challenging to mitigate. One
possible source for these biases is gender representation disparities in the
training and evaluation data. Despite recent progress in documenting this
problem and many attempts at mitigating it, we still lack shared methodology
and tooling to report gender representation in large datasets. Such
quantitative reporting will enable further mitigation, e.g., via data
augmentation. This paper describes the Gender-GAP Pipeline (for Gender-Aware
Polyglot Pipeline), an automatic pipeline to characterize gender representation
in large-scale datasets for 55 languages. The pipeline uses a multilingual
lexicon of gendered person-nouns to quantify the gender representation in text.
We showcase it to report gender representation in WMT training data and
development data for the News task, confirming that current data is skewed
towards masculine representation. Having unbalanced datasets may indirectly
optimize our systems towards outperforming one gender over the others. We
suggest introducing our gender quantification pipeline in current datasets and,
ideally, modifying them toward a balanced representation.Comment: 15 page
SeamlessM4T-Massively Multilingual & Multimodal Machine Translation
What does it take to create the Babel Fish, a tool that can help individuals
translate speech between any two languages? While recent breakthroughs in
text-based models have pushed machine translation coverage beyond 200
languages, unified speech-to-speech translation models have yet to achieve
similar strides. More specifically, conventional speech-to-speech translation
systems rely on cascaded systems that perform translation progressively,
putting high-performing unified systems out of reach. To address these gaps, we
introduce SeamlessM4T, a single model that supports speech-to-speech
translation, speech-to-text translation, text-to-speech translation,
text-to-text translation, and automatic speech recognition for up to 100
languages. To build this, we used 1 million hours of open speech audio data to
learn self-supervised speech representations with w2v-BERT 2.0. Subsequently,
we created a multimodal corpus of automatically aligned speech translations.
Filtered and combined with human-labeled and pseudo-labeled data, we developed
the first multilingual system capable of translating from and into English for
both speech and text. On FLEURS, SeamlessM4T sets a new standard for
translations into multiple target languages, achieving an improvement of 20%
BLEU over the previous SOTA in direct speech-to-text translation. Compared to
strong cascaded models, SeamlessM4T improves the quality of into-English
translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in
speech-to-speech. Tested for robustness, our system performs better against
background noises and speaker variations in speech-to-text tasks compared to
the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and
added toxicity to assess translation safety. Finally, all contributions in this
work are open-sourced and accessible at
https://github.com/facebookresearch/seamless_communicatio